Search CORE

18 research outputs found

Exploiting loop level parallelism in nonprocedural dataflow programs

Author: Gokhale Maya B.
Publication venue
Publication date
Field of study

Discussed are how loop level parallelism is detected in a nonprocedural dataflow program, and how a procedural program with concurrent loops is scheduled. Also discussed is a program restructuring technique which may be applied to recursive equations so that concurrent loops may be generated for a seemingly iterative computation. A compiler which generates C code for the language described below has been implemented. The scheduling component of the compiler and the restructuring transformation are described

NASA Technical Reports Server

Parallel scheduling of recursively defined arrays

Author: Gokhale Maya B.
Myers Thomas J.
Publication venue: 'Elsevier BV'
Publication date: 31/08/1988
Field of study

AbstractThis paper describes a new method of automatic generation of concurrent programs which construct arrays defined by sets of recursive equations. We assume that the time of computation of an array element is a linear combination of its indices, and we use integer programming to seek a succession of hyperplanes along which array elements can be computed concurrently. The method can be used to schedule equations involving variable length dependency vectors and mutually recursive arrays. Portions of the work reported here have been implemented in the PS automatic program generation system

Elsevier - Publisher Connector

Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems

Author: Gokhale Maya
Peng Ivy B.
Wahlgren Jacob
Publication venue
Publication date: 04/11/2022
Field of study

Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of composable memory subsystems are studied -- fine-grained capacity provisioning and scalable bandwidth provisioning. We developed an emulator to explore the performance impact of various memory compositions. We also provide a profiler to identify the memory usage patterns in applications and their optimization opportunities. Seven scientific and six graph applications are evaluated on various emulated memory configurations. Three out of seven scientific applications had less than 10% performance impact when the pooled memory backed 75% of their memory footprint. The results also show that a dynamically configured high-bandwidth system can effectively support bandwidth-intensive unstructured mesh-based applications like OpenFOAM. Finally, we identify interference through shared memory pools as a practical challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory Centric High Performance Computing (MCHPC'22) at SC2

arXiv.org e-Print Archive

PS: A NONPROCEDURAL LANGUAGE WITH DATA TYPES AND MODULES

Author: Bngley Rommh Carter
Maya B. Gokhale
Maya B. Gokhale
Publication venue
Publication date
Field of study

The Problem Specification (PS) nonprocedural language is a very high level language for algorithm specification. PS is suitable for nonprogrammers, who can specify a problem using mathematically-oriented equations; for expert programmers, who can prototype different versions of a software system for evaluation; and for those who wish to use specifications for portions (if not all) of a program. PS has data types and modules similar to Modula-2. The compiler generates C code. In this paper, we first show PS by example, and then discuss efEiciency issues in scheduling and code generation

CiteSeerX

Promises and Pitfalls of Reconfigurable Supercomputing

Author: Christopher D. Rickett
Chung Hsing Hsu
Maya B. Gokhale
Publication venue
Publication date
Field of study

Reconfigurable supercomputing (RSC) combines programmable logic chips with high performance microprocessors, all communicating over a high bandwidth, low latency interconnection network. Reconfigurable hardware has demonstrated an order of magnitude speedup on compute-intensive kernels in science and engineering. However, translating high level algorithms to programmable hardware is a formidable barrier to the use of these resources by scientific programmers. A library-based approach has been suggested, so that the software application can call standard library functions that have been optimized for hardware. The potential benefits of this approach are evaluated on several large scientific supercomputing applications. It is found that hardware linear algebra libraries would be of little benefit to the applications analyzed. To maximize performance of supercomputing applications on RSC, it is necessary to identify kernels of high computational density that can be mapped to hardware, carefully partition software and hardware to reduce communications overhead, and optimize memory bandwidth on the FPGAs. Two case studies that follow this approach are summarized, and, based on experience with these applications, directions for future reconfigurable supercomputing architectures are outlined. 1

CiteSeerX

Accelerating a random forest classifier: multi-core, GP-GPU, or FPGA? Accelerating a random forest classifier: multi-core, GP-GPU, or FPGA?

Author: B C Van Essen
Brian Van Essen
C C Macaraeg
Chris Macaraeg
M Gokhale
Maya Gokhale
R Prenger
Ryan Prenger
Publication venue
Publication date: 02/04/2020
Field of study

Abstract-Random forest classification is a well known machine learning technique that generates classifiers in the form of an ensemble ("forest") of decision trees. The classification of an input sample is determined by the majority classification by the ensemble. Traditional random forest classifiers can be highly effective, but classification using a random forest is memory bound and not typically suitable for acceleration using FPGAs or GP-GPUs due to the need to traverse large, possibly irregular decision trees. Recent work at Lawrence Livermore National Laboratory has developed several variants of random forest classifiers, including the Compact Random Forest (CRF), that can generate decision trees more suitable for acceleration than traditional decision trees. Our paper compares and contrasts the effectiveness of FPGAs, GP-GPUs, and multi-core CPUs for accelerating classification using models generated by compact random forest machine learning classifiers. Taking advantage of training algorithms that can produce compact random forests composed of many, small trees rather than fewer, deep trees, we are able to regularize the forest such that the classification of any sample takes a deterministic amount of time. This optimization then allows us to execute the classifier in a pipelined or single-instruction multiple thread (SIMT) fashion. We show that FPGAs provide the highest performance solution, but require a multi-chip / multi-board system to execute even modest sized forests. GP-GPUs offer a more flexible solution with reasonably high performance that scales with forest size. Finally, multi-threading via OpenMP on a shared memory system was the simplest solution and provided near linear performance that scaled with core count, but was still significantly slower than the GP-GPU and FPGA

CiteSeerX

Parallel scheduling of recursively defined arrays

Author: Allen
Allen
Backus
Banerjee
Cytron
Gokhale
Gokhale
Henzinger
Kuck
Kuck
Lamport
Maya B. Gokhale
McCarthy
Midkiff
O’Donnell
Polychronopoulos
Prywes
Samet
Taha
Thomas J. Myers
Torgersen
Warren
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref